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Abstract — Many applications require data processing to be 
performed on individual pieces of data which are of finite 
sizes, e.g., files in cloud storage units and packets in data 
networks. However, traditional universal compression solutions 
would not perform well over the finite-length sequences. Recently, 
we proposed a framework called memory-assisted universal 
compression that holds a significant promise for reducing the 
amount of redundant data from the finite-length sequences. The 
proposed compression scheme is based on the observation that 
it is possible to learn source statistics (by memorizing previous 
sequences from the source) at some intermediate entities and 
then leverage the memorized context to reduce redundancy of 
the universal compression of finite-length sequences. We first 
present the fundamental gain of the proposed memory-assisted 
universal source coding over conventional universal compression 
(without memorization) for a single parametric source. Then, 
we extend and investigate the benefits of the memory-assisted 
universal source coding when the data sequences are generated 
by a compound source which is a mixture of parametric sources. 
We further develop a clustering technique within the memory- 
assisted compression framework to better utilize the memory 
by classifying the observed data sequences from a mixture of 
parametric sources. Finally, we demonstrate through computer 
simulations that the proposed joint memorization and cluster- 
ing technique can achieve up to 6-fold improvement over the 
traditional universal compression technique when a mixture of 
non-binary Markov sources is considered. 

I. Introduction 

Since Shannon's seminal work on the analysis of commu- 
nication systems, many researchers have contributed toward 
the development of compression schemes with the average 
code length as close as possible to the entropy. In practice, 
we usually cannot assume a priori knowledge on the statistics 
of the source although we still wish to compress the unknown 
stationary ergodic source to its entropy rate. This is known as 
the universal compression problem |£TJ — fl3j . However, unfortu- 
nately, universality imposes an inevitable redundancy depend- 
ing on the richness of the class of the sources with respect 
to which the code is universal [4|-[6|. While an entire library 
of concatenated sequences from the same context (i.e., source 
model) can usually be encoded to less than a tenth of the 
original size using universal compression [0, @, it is usually 
not an option to concatenate and compress the entire library 
at once. On the other hand, when an individual sequence 
is universally compressed regardless of other sequences, the 
performance is fundamentally limited @, 0. 

In 0, the authors observed that forming a statistical model 
from a training data would improve the performance of 



Fig. 1 . Memory-assisted compression in a two-hop communication scenario. 

universal compression on finite-length sequences. In (8], we 
introduced memory-assisted universal source coding, where 
we proposed memorization of the previously seen sequences 
as a solution that can fundamentally improve the performance 
of universal compression. As an application of memory- 
assisted compression, we introduced the notion of network 
compression in J9], iflOl . It was shown that by deploying 
memory in the network (i.e., enabling some nodes to memorize 
source sequences), we may remove the redundancy in the 
network traffic. In IflOl . we assumed that memorization of 
the previous sequences from the same source provides a fun- 
damental gain g over and above the conventional compression 
performance of the universal compression of a new sequence 
from the same source. Given g, we derived the network-wide 
memorization gain Q on both a random network graph J5] 
and a power-law network graph [10] when a small fraction 
of the nodes in the network are capable of memorization. 
However, J9), IflOl did not explain as to how g is computed. 

Although the memory-assisted universal source coding natu- 
rally arises in a various set of problems, we define the problem 
setup in the most basic network scenario depicted in Fig Q] 
We assume that the network consists of the server S, the 
intermediate (relay) node R, and the client C, where S wishes 
to send the sequence x 11 to C. We assume that C does not have 
any prior communication with the server, and hence, is not 
capable of memorization of the source context. However, as an 
intermediate node, R has observed several previous sequences 
from S when forwarding them from S to clients other than 
C (not shown in Fig. [TJ. Therefore, R has formed a memory 
of the previous communications shared with S. Note that if 
the intermediate node R was absent, the source could possibly 
apply universal compression to x n and transmit to C whereas 
the presence of the memorized sequences at R can potentially 
reduce the communication overhead in the S-R link. 

The objective of the present paper is to characterize the 
fundamental gain g of memorization of the context from a 
server's previous sequences in the universal compression of 
a new individual sequence from the same server. Clearly, 
a single stationary ergodic source does not fully model a 
real content generator server (for example the CNN news 



website in the Internet). Instead, a better model is to view 
every content generator server as a compound (mixture) of 
several information sources whose true statistical models are 
not readily available. In this work, we try to address this 
issue and propose a memorization and clustering technique for 
compression that is suitable for a compound source. Namely, 
we would like to answer the following questions in the above 
setup: 1) Would the deployment of memory in the encoder 
(S) and the decoder (i?) provide any fundamental benefit in 
the universal compression? 2) If so, how does this gain g 
vary as the sequence length n and the memorized context 
length m change? 3) How much performance improvement 
should we expect from the joint memorization and clustering 
versus the memorization without clustering? 4) How should 
we realize the clustering scheme to achieve good performance 
from compression with the joint memorization and clustering? 

II. Background Review and Motivation 

In this section, we motivate the context memorization prob- 
lem by demonstrating the significance of redundancy in the 
universal compression of small to moderate length sequences. 
Let A be a finite alphabet. Let the parametric source be defined 
using a d-dimensional parameter vector 9 — (6*i, 9j), where 
d denotes the number of the source parameters. Denote /xg 
as the probability measure defined by the parameter vector 
9 on sequences of length n. We also use the notation /ig to 
refer to the parametric source itself. We assume that the d 
parameters are unknown. Denote V d as the family of sources 
with d-dimensional unknown parameter vector 9. We use the 
notation x n = (xi,...,x n ) G A n to present a sequence of 
length n from the alphabet A. 

Let H n {9) be the source entropy given 9, i.e., 
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In this paper log(-) always denotes the logarithm in base 2. Let 
c„ : A n — > {0, 1}* be an injective mapping from the set A n of 
the sequences of length n over A to the set {0, 1}* of binary 
sequences. Further, let l n (x n ) denote a universal length func- 
tion for the codeword associated with the sequence x" . Denote 
Rn(ln, &) as the expected redundancy of the code with length 
function /„(•), defined as R n (l n , 9) = W n {X n )-H n (9). Note 
that the expected redundancy is always non-negative. 
Let I n {9) be the Fisher information matrix, i.e., 
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Fisher information matrix quantifies the amount of informa- 
tion, on the average, that each symbol in a sample sequence 
x n from the source conveys about the source parameters. 
Let Jeffreys' prior on the parameter vector 9 be denoted by 

ut(9) = — l 2 ' 1 - 6 '' 1 ^ — . Jeffreys' prior is optimal in the sense 

f\l(\)\2d\ 

that the average minimax redundancy is achieved when the 

'Throughout this paper expectations are taken over the random sequence 
X n with respect to the probability measure fig unless otherwise stated. 
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Fig. 2. The Lower bound on compression for at least 95% of the sources as 
a function of sequence length n for different values of entropy rate H n (8)/n. 



parameter vector 9 is assumed to follow Jeffreys' prior ifTTI . 
Further, let R n be the average minimax redundancy given 

by urn, ma 



log / \X n {9)\-d9 + 
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In 0, we obtained a lower bound on the average re- 
dundancy of the universal compression for the family of 
conditional two-stage codes, where the unknown parameter is 
first estimated and the sequence is encoded using the estimated 
parameter, as the following 0: 

Theorem 1 Assume that the parameter vector 9 follows Jef- 
freys' prior in the universal compression of the family of 
parametric sources V . Let S be a real number. Then, 
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Theorem Q] can be viewed as a tighter variant of Theorem 1 
of Merhav and Feder in [0] for parametric sources. 

To demonstrate the significance of the above theorem, we 
consider an example using a first-order Markov source with 
alphabet size k = 256. This source may be represented using 
d = 256 x 255 = 62580 parameters. Fig. [2] shows the average 
number of bits per symbol required to compress the class of 
the first-order Markov sources normalized to the entropy of the 
sequence for different values of entropy rates in bits per source 
symbol (per byte). In this figure, the curves demonstrate the 
lower bound on the compression rate achievable for at least 
95% of sources , i.e., the probability measure of the sources 
from this class that may be compressed with a redundancy 
smaller than the curve is at most e = 0.05. As can be seen, 
if the source entropy rate is 1 bit per byte (H n (9)/n = 1), 
the compression overhead is 38%, 16%, 5.5%, 1.7%, and 
0.5% for sequences of lengths 256kB, 1MB, 4MB, 16MB, 
and 64MB, respectively. Hence, we conclude that redundancy 
is significant in the compression of finite-length low-entropy 
sequences, such as the Internet traffic. It is this redundancy 
that we hope to remove using the memorization technique. 



III. Fundamental Gain of Context Memorization 

In this section, we present the problem setup and define 
the context memorization gain. We assume that the compound 
source comprises of a mixture of JC information sources. 
Denote [JC] as the set {1, JC}. As the first step, in this paper, 
we assume that JC is finite and fixed. We consider parametric 
sources with f?W as the parameter vector for the source i 
(i e [JC]). As in [5 1, we assume that 6>« = (9^ , 9 ( 2 l) ,...,9f) 
follows Jeffreys' prior for all i <G [JC]. We consider the 
following scenario. We assume that, in Fig.Q] both the encoder 
(at S) and the decoder (at R) have access to a memory of 
the previous T sequences from the compound source. Let 
m = (rig, . . . , tit-i) denote the lengths of the previous T 
sequences generated by S. Further, denote y = {t/" 3 ! (J)}J~ 
as the previous T sequences from S visited by the memory 
unit R. Note that each of these sequences might be from a 
different source model. We denote p = (pi, ...,£>jc), where 
J2f=iPi — 1> as the probability distribution according to 
which the information sources in the compound source are se- 
lected for sequence generation, i.e., the source i is picked with 
probability pi. Let the random variable Zj denote the index of 
the source that has generated the sequence y nj (j), and hence, 
Zj follows the distribution p over [JC] . Therefore, at time step 
j, sequence y nj (j) is generated using the parameter vector 
0( z j), Further, denote Z as the vector Z = (Z , Z T -i). We 
wish to compress the sequence x" with source index Zt, when 
both the encoder and the decoder have access to a realization 
y of the random vector Y. This setup, although very generic, 
can incur in many applications. As the most basic example, 
consider the communication scenario in Fig. Qj The presence 
of memory y at R can be used by S to compress (via memory- 
assisted source coding) the sequence x n which is requested by 
client C from S. The compression can reduce the transmission 
cost on the S—R link while being transparent to the client, i.e., 
R decodes the memory-assisted source code and then applies 
conventional universal compression to x n and transmits to C. 

In order to investigate the fundamental gain of the context 
memorization in the memory-assisted universal compression 
of the sequence x n over conventional universal source coding, 
we compare the following three schemes. 

• Ucomp (Universal compression), in which a sole univer- 
sal compression is applied on the sequence x n without 
regard to the memorized sequence y. 

• UcompM (Universal compression with context memo- 
rization), in which the encoder S and the decoder R 
both have access to the memorized sequence y from the 
compound source, and they use y to learn the statistics 
of the source for the compression of the sequence x n . 

• UcompCM (Universal compression with source-defined 
clustering of the memory), which assumes that the mem- 
ory y is shared between the encoder S and the decoder 
R (i.e., the memory unit). Further, the source defined 
clustering of memory implies that both S and R exactly 
know the index Z of the memorized sequences. 

The performance of Ucomp is characterized using the expected 



redundancy R n (l n ,9), which is discussed in Sec. HI] Let 
Q(l n > ln>Q) be defined as the ratio of the expected codeword 
length with length function l n to that of l„, i.e., 

0(1 1 9) = E ^"(^") = + Rn{ln,0) 

EZ„(X») H n (9) + R n (l n ,0)' 

Further, let e be such that < e < 1. We define g(l n , In, e) as 
the gain of the length function l n as compared to l n . That is 



g(l n , l n , 0, e) 



sup • 



Q(ln,L,0) > Z 



> 1 



(5) 



In the case of UcompM, let l n \ m be the length function 
with context memorization, where the encoder S and the 
decoder R have access to sequences y with lengths m. Let 
m = |m| = Y^jZo n j denote the total length of memory@ 
Further, let <j> — 9^ Zt \ Denote R n (ln\m,4>) as the expected 
redundancy of encoding a sequence of length n form the para- 
metric source using the length function l n \ m . We denote 
gu{n, m, <f>, e, p) = ~Ezg(l n , l n \m, 4>, e ) as the fundamental 
gain of the context memorization on the family of parametric 
sources V d on a sequence of length n using context memory 
lengths m for a fraction (1 — e) of the sources. In other words, 
context memorization provides a gain at least gu{n, m, </>, e, p) 
for a fraction (1 — e) of the sources in the family. 

Similarly in the case of UcompCM, let l n \m,z denote the 
length function for the universal compression of a sequence of 
length n with memorized sequences y, where the vector Z of 
the source indices is known. We denote gcu(n,m, (j), e,p) = 
Ezs('n, ln\m,z, <t>, f) as the fundamental gain of the context 
memorization in UcompCM. The following is a trivial lower 
bound on the context memorization gain in UcompCM. 

Fact 2 The fundamental gain of context memorization is: 
g C M(n,m,(f),e,p) > 1. 

Fact [2] simply states that the context memorization with source 
defined clustering does not worsen the performance of the 
universal compression. We stress again that the saving of 
memory-assisted compression in terms of flow reduction is 
only obtained in the S-R link. For example, for the given 
memorization gain gcM{n,m,(f>,e,p) — go, the expected 
number of bits needed to transfer x n to R is reduced from 
Vl n (X r - 

IV. Results on the Memorization Gain 



in Ucomp to ±~El n (X n ) in UcompCM. 



In this section, we present our main results on the mem- 
orization gain with and without clustering. The proofs are 
omitted due to the lack of space. We give further consideration 
to the case JC — 1 since it represents the memorization gain 
when all of the memorized sequences are from a single fixed 
source model. 
A. Case JC = 1 

In this case, since p = 1 and Z = 1 is known, there is no 
distinction between UcompM and UcompCM, and hence, we 
drop the subscript of g. The next theorem characterizes the 
fundamental gain of memory-assisted source coding: 

2 We assume that rij S> h, where h is the height of the tree of the class 
V d , and hence, the impact of the concatenation of the sequences is negligible. 



Theorem 3 Assume that the parameter vector 9 follows Jef- 
freys' prior in the universal compression of the family of 
parametric sources V . Then, 

g{n,m,(j),e, 1) > 1 + — — — j— — +0 

where R^n, m) = f log (l + ^) + 2. 



Further, let g(n, oo, (j>, e, p) be defined as the achievable gain 
of memorization where there is no constraint on the length of 
the memory, i.e, g(n, oo, 4>, e, p) = lirrim^oo g(n, m, cf>, e, p). 
The following Corollary quantifies the memorization gain for 
unbounded memory size. 

Corollary 4 Assume that the parameter vector 9 follows 
Jeffreys ' prior in the universal compression of the family of 
parametric sources V . Then, 

R n + log(e) - 2 



g(n,oo,<j>,e, 1) > 1 + 



H n ((f>) 



Next, we consider the case where the sequence length n grows 
to infinity. Intuitively, we would expect that the memorization 
gain become negligible for the compression of long sequences. 
Let g(oo, m, 4>, e, p) = linin^oo g(n, m, </>, e, p). In the fol- 
lowing, we claim that memorization does not provide any 
benefit when h->oo: 

Proposition 5 g(n, m, (j), e, 1) approaches 1 as the length of 
the sequence x n grows, i.e., g(oo, m, cf>, e, 1) = 1. 

B. UcompM: Case JC > 2 

As stated in the problem setup, the sequences in the memory 
may be from various sources. This raises the question that 
whether a naive memorization of the previous sequences 
using UcompM without regard to which source parameter has 
indeed generated the sequence would suffice to achieve the 
memorization gain. Let T> be an upper bound on the size 
of the context tree used in compression. Denote ji^ as the 
probability measure that is defined on the tree of depth T> 
from the mixture of the sources. Further, let D n (fi^\\fj,g) = 

J2x n Z^C 1 ") log ( ^(a-") )- The following proposition char- 
acterizes the performance of UcompM when applied to a 
compound source for JC > 2. 

Proposition 6 Let 9^ (i <G [JC]) follow Jeffreys' prior. Then, 
the memorization gain in UcompM as m oo is upper 
bounded by 



g M (n, co, (f>,e,p) < 



H n {<j>) + R„ 



H n (4>) + DnimWfif) 



oi 1 - 

n 



Note that since K, > 2, then Ailj^l \P-f) — 6(n)0(unless 
#W = 9(3) for all i,j € [K], which occurs with zero probabil- 
ity). Therefore, the redundancy of UcompM is R n {l n \m> 4>) = 

3 f(n) = e{g(n)) if and only if f(n) = 0(g(n) and g(n) = 0(/(n)). 



Q(n) with probability one. Proposition [6] states that when 
the context is built using the mixture of the sources, with 
probability one, the redundancy of UcompM is worse than the 
redundancy of Ucomp for a sufficiently large sequence, i.e., 
the memorization gain becomes less than unity for sufficiently 
large n. Therefore, the crude memorization of the context 
by node R in Fig. Q] from the previous communications not 
only does not improve the compression performance but also 
asymptotically makes it worse. We shall see some discussion 
on validation of this claim based on simulations in Sec. I VII 

C. UcompCM: Case fC > 2 

Thus far, we demonstrated in Proposition [6] that the crude 
memorization in the memory unit is not beneficial when a 
compound source is present. This necessitates to first appro- 
priately cluster the sequences in the memory. Then, based on 
the criteria as to which cluster the new sequence x n belongs 
to, we utilize the corresponding memorized context for the 
compression. In the following, we analyze the problem for 
the source-defined clustering (defined in Sec. ITlTb . In this 
clustering, we assume both £ and R (in Fig. [TJ can exactly 
know the index i £ [JC] and hence all the sequences that belong 
to the same source #M in the compound source are assigned 
to the same cluster (for all i £ [JC]). Further, we assume that 
S can exactly classify the new sequence x n to the cluster 
with parameter 9^ Zt K In Sec.[V] however, we will relax these 
assumptions and stu dy the impact of clustering in practice. 

Let H(p) = — X)j=i Pi l°g(Pi) be the entropy of the source 
model. The following proposition quantifies the achievable 
redundancy and the memorization gain of UcompCM. 

Theorem 7 Let #W (i £ [JC]) follow Jeffreys' prior. Then, the 
memorization gain of UcompCM is lower bounded by 

i a wi , Rn + \og(e) ~ R 2 (n,m) ( 1 
gcu{n, m, <p, e, p) > H — K : +0' 



H n {4>) + R 2 {n,m) 
where R 2 (n, m) 4 f log (l + ^) + 3 + H(p). 

V. Clustering for Memory-Assisted Compression 

In this section, we try to answer the main question in the 
memory-assisted compression setup we introduced: "How do 
we utilize the available memory to better compress a sequence 
generated by a compound source?" It is obvious that the 
performance of conventional universal compression schemes 
(those without memory) cannot be improved by clustering 
of the compound source as x n is encoded without regard 
to y. However, because of a compound source, clustering is 
necessary to effectively utilize the memory in the proposed 
memory-assisted compression. Within this framework, we 
identify two interrelated problems: 1) How do we perform 
clustering of the memorized data to improve the performance 
of memory-assisted compression? 2) Given a set of clustered 
memory, how do we classify an incoming new sequence into 
one of the clusters in the memory using which the performance 
of memory-assisted compression is maximized? This relaxes 
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Fig. 3. Theoretical lower bound on the mem- 
orization gain g(n, m, 0, 0.05, 1). 

the assumption of knowing Z by the encoder and the decoder 
in the analysis of Sec. [IV] 

As one approach, it is natural to adapt a clustering algo- 
rithm, among the many, that has the codelength minimization 
as its principle criterion. Thus, the goal of the clustering is to 
group the sequences in the memory such that the total length 
of all the encoded sequences is minimized (i.e., the sequences 
are grouped such that they are compressed well together). We 
employ a Minimum Description Length (MDL) 11311 approach 
suggested by |14|. The MDL model selection approach is 
based on finding shortest description length of a given se- 
quence relative to a model class. We do not have a proof that 
the MDL clustering is necessarily optimal for our goal (for 
all sequence lengths and memory sizes). However, as we will 
see in Sec. IVI1 for the cases of interest where the length of 
memory is larger than the length of the new sequence, the 
MDL clustering demonstrates a very good performance close 
to that of assuming to know Z (in source-defined clustering). 

Now, we would like to find a proper class for a new 
sequence x n generated by one of the JC sources. Given a set 
of T sequences taken from JC different sources, we assume 
those T sequences have already been clustered into JC clusters 
C\, . . . ,Cfz- Then, the classification algorithm for x n is as 
follows. We include the sequence x n in each cluster one at 
a time and find the total description length of all sequences 
in the JC clusters. Then, we label x n with the cluster whose 
resulting total description length is the minimum. 

Next, we describe as to how we cluster the T sequences 
in memory. A good clustering is such that it allows efficient 
compression of the whole data set. Equivalently, the sequences 
that are clustered together should also compress well together. 
We start by an initial clustering of the data set in the memory. 
Through experiments, we observed that this initial clustering 
has a considerable impact on the convergence of the clustering 
algorithm which is in accordance with the observation in [14|. 
We found that an initial clustering based on the estimated 
entropy of the sequences greatly reduces the number of 
iterations till convergence. The clustering is done iteratively 
by surfing the data set and moving a sequence from cluster i 
to cluster j if this swapping results in a shorter total description 
length of the whole data set. 
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Fig. 5. Theoretical and simulation results for 
g M and g CM . 

TABLE I 

Memory-assisted compression gain under MDL clustering 





n = 10KB 


n = 100KB 


3MDL (m = 1MB) 


1.7308 


1.0372 


SMDL (m = 10MB) 


1.8862 


1.0939 


3mdl/5cm (rn = 1MB) 


0.944 


0.993 


Smdl/scm (m = 10MB) 


0.939 


0.998 



VI. Simulation Results and Conclusion 
In this section, we characterize the performance of the 
proposed memorization scheme through computer simulations. 
In order to illustrate the importance of clustering for effi- 
cient use of memory, we have evaluated the memory-assisted 
compression gain (over the performance of the conventional 
universal compression) for three cases: gu, 9cm, and <7mdl- 
Note that 9mdl is defined as the gain of memorization with 
MDL clustering in Sec. [V] We demonstrate the significance 
of the memorization through an example, where we again 
consider JC first-order Markov sources with alphabet size 
k = 256, source entropy Hn ^> — 1 bit per byte, and e = 0.05. 

Fig. [3] considers the single source case (i.e., JC = 1). The 
lower bound on the memorization gain is demonstrated as 
a function of the sequence length n for different values of 
the memory size m. As can be seen, significant improvement 
in the compression may be achieved using memorization. As 
demonstrated in Fig. [3] the memorization gain for a memory 
of length m = 8MB is very close to g(n, oo, <fi, 0.05, 1), and 
hence, increasing the memory size beyond 8MB does not 
result in the substantial increase of the memorization gain. We 
observe that more than 50% improvement is achieved in the 
compression performance of a sequence of length n — 128kB 
with a memory of m — 8MB. On the other hand, as n —> oo, 
the memorization gain becomes negligible as expected. 

For the rest of the experiments, we fixed the length of the 
sequences generated by the source, i.e., rij — n for all j. 
We used JC = 10 with uniform distribution, i.e., pi — for 
i € [JC]. Further, we performed compression using Context 
Tree Weighting (CTW) [ 3 1 and averaged the simulation results 
over multiple runs of the experiment. Fig. |4] depicts gcM- As 
can be seen, joint memorization and clustering achieves up to 
6-fold improvement over the traditional universal compression. 
Fig. [5] depicts the experimental gcM using CTW, the theoretical 
lower bound on qcm derived in Sec. [IV] and the experimental 
<7m for memory m = 10MB. As we expected, with no 




clustering, the memory-assisted compression may result in a 
worse compression rate than compression with no memory 
validating our theoretical result in Sec. IIV-BI Finally, the 
experimental results of memory-assisted compression gain 
<?mdl under MDL clustering, summarized in Table H] show 
that 5mdl is close to gcyi, demonstrating the effectiveness of 
MDL clustering for compression. 

In conclusion, this paper demonstrated that memorization 
(i.e., learning the source statistics) can lead to a fundamental 
performance improvement over the traditional universal com- 
pression. This was presented for both single and compound 
sources. We derived theoretical results on the achievable gains 
of memory-assisted source coding for a compound (mixture) 
source and argued that clustering is necessary to obtain mem- 
orization gain for compound sources. We also presented a fast 
MDL clustering algorithm tailored for the compression prob- 
lem at hand and demonstrated its effectiveness for memory- 
assisted compression of finite-length sequences. 
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